Topology-Aware Parallelism for NUMA Copying Collectors
نویسندگان
چکیده
NUMA-aware parallel algorithms in runtime systems attempt to improve locality by allocating memory from local NUMA nodes. Researchers have suggested that the garbage collector should profile memory access patterns or use object locality heuristics to determine the target NUMA node before moving an object. However, these solutions are costly when applied to every live object in the reference graph. Our earlier research suggests that connected objects represented by the rooted subgraphs provide abundant locality and they are appropriate for NUMA architecture. In this paper, we utilize the intrinsic locality of rooted sub-graphs to improve parallel copying collector performance. Our new topology-aware parallel copying collector preserves rooted sub-graph integrity by moving the connected objects as a unit to the target NUMA node. In addition, it distributes and assigns the copying tasks to appropriate (i.e. NUMA node local) GC threads. For load balancing, our solution enforces locality on the work-stealing mechanism by stealing from local NUMA nodes only. We evaluated our approach on SPECjbb2013, DaCapo 9.12 and Neo4j. Results show an improvement in GC performance by up to 2.5x speedup and 37% better application performance.
منابع مشابه
Impact of Numa Effects on High-speed Networking with Multi-opteron Machines
The ever-growing level of parallelism within the multi-core and multi-processor nodes in clusters leads to the generalization of distributed memory banks and busses with nonuniform access costs. These NUMA effects have been mostly studied in the context of threads scheduling and are known to have an influence on high-performance networking in clusters. We present an evaluation of their impact o...
متن کاملEvaluation of OpenMP Task Scheduling Algorithms for Large NUMA Architectures
Current generation of high performance computing platforms tends to hold a large number of cores. Therefore applications have to expose a fine-grain parallelism to be more efficient. Since version 3.0, the OpenMP standard proposes a way to express such parallelism through tasks. Because the task scheduling strategy is implementation defined, each runtime can have a different behavior and effici...
متن کاملPerformance Evaluation of HPC Benchmarks on VMware's ESXi Server
A major obstacle to virtualizing HPC workloads is a concern about the performance loss due to virtualization. We will demonstrate that new features significantly enhance the performance and scalability of virtualized HPC workloads on VMware’s virtualization platform. Specifically, we will discuss VMware’s ESXi Server performance for virtual machines with up to 64 virtual CPUs as well as support...
متن کاملCollecting Network Status Information for Network-Aware Applications
Network-aware applications, i.e., applications that adapt to network conditions in an application specific way, need both static and dynamic information about the network to be able to adapt intelligently to network conditions. The CMU Remos interface gives applications access to a wide range of information in a network-independent fashion. Remos uses a logical topology to capture the network i...
متن کاملGarbage Collection Alternatives for Icon
Copying garbage collectors are becoming the collectors of choice for very high-level languages and for functional and object-oriented languages. Copying collectors are particularly efficient for large storage regions because their execution time is proportional only to the amount of accessible data, and they identify and compact this data in one pass. In contrast, mark-and-sweep collectors exec...
متن کامل